This document was updated on 2017-10-03.


Intro

  • Welp. Lets just get through all the code examples.
  • I spend too long talking if I start so I’m just gonna do the code and I will come back and add notes if I need to.
  • The book has the best notes. : )
  • Really, this doc is jut a reference for me to use later
library(stringr)
library(readr)
library(dplyr)
library(ggvis)
library(ISLR)

2.1 What is Statistical Learning?

In essence, statistical learning refers to a set of approaches for estimating \(f\).

  • \(f\) represents the systematic information that \(X\) provides about \(Y\).
  • \(Y\) is a quantitave response that is observed. \(X\) is a predictor that we believe explains some of \(Y\)s value.
  • We may have many predictors that we assume have a relationship with \(Y\).
  • There is also some random variability or error in \(Y\) which we call \(\epsilon\).
  • \(f\) is some fixed but unknown function of \(X_1,...,X_p\), and \(\epsilon\) is a random error term, which is independent of \(X\) and has mean zero.
  • We write this generally as …

\[Y = f(X) + \epsilon \] It may be easier to think of this with an example:

  • Say we are a consultant to an advertising firm and our task is to provide advice on how to improve sales of a particular product.
  • There is an advertising data set consisting of sales of the product in 200 different markets, along with advertising budgets for three different media: TV, radio, and newspaper.
  • Our client cannont directly manipulate the sales \(Y\)
  • Our goal is to come up with an accurate model \(f\) that can be used to predict sales \(Y\) based on the advertising budgets \(X\).
  • Then we can recommend how they can adjust \(X\) to increase \(Y\).
  • We will represent the budgets as \(X_1\) for TV, \(X_2\) for radio, etc. These are called predictors, indpependent variables, input variables, features, or just variables.
  • Sales \(Y\) is called the output variable, response, or dependent variable.

2.1.1 Why estimate \(f\)?

Two main reasons: prediction and inference

Prediction:

  • For example, suppose we have the characteristics \(X\) of a patient’s blood sample from a lab. \(Y\) is a variable encoding the patient’s risk for a severe adverse reaction to a particular drug.
  • We seek to predict Y using X so we can say if its okay to give this patient the drug or not.
  • Its important to note that our prediction \(\widehat{Y}\) will never be exactly right. There is always some error.
  • There are two kinds of error, reducible error and irreducible error.
  • The reducible error
    • This is what we can reduce if we improve the accuracy of our model, \(\widehat{f}\). We try to find the best statistical learning technique to create the most accurate model to estimate \(f\).
  • irreducible error
    • This is the natural variablity in the respose. Even if we have a perfect model for \(f\), there will still be some error it it. \(\epsilon\) cannont be predicted with \(X\). The error can be made up of unmeasurable variation in the drug, say from manufacturing variation, or possibly an unmeasured variable that would help us predict \(Y\) but we have not measured it in the blood, so we cannot use it.
  • If our goal is prediction we don’t really care about the make up of our model \(\widehat{f}\) but just care that it results in an accurate prediction. In this case more complicated models that do not give an explaination for how each variable affects \(Y\) can be used. We call these ‘black box’ models. Here, we are free to use more complicated model.
  • A prediciton question would be: Which individuals are likely to respond positively to a mailing advertisment based on their demographic information?

Inference:

  • Sometimes we just want to know how the variables \(X\) impact the response \(Y\), not to predict \(Y\).
  • In this case we want to use models that allow us to explain how the variables \(X\) affect the respose \(Y\). A complicated black box model does not help us understand what is going on. Certain types of models are more usefule here. They may not be as accurate as others for predicting \(Y\) but they explain how things impact \(Y\) which is what is most important. We need to know the form of \(\widehat{f}\).
  • Here we may want to find the small set of predictors that are substantially associated with Y. We may want to know how each predictor affects \(Y\). Can we use a linear model or do we need something more complex?
  • A good example of inference is the question: What effect will changing the price of a product have on sales?

This is the key take away:

Depending on whether our ultimate goal is prediction, inference, or a combination of the two, different methods for estimating f may be appropriate. For example, linear models allow for relatively simple and interpretable inference, but may not yield as accurate predictions as some other approaches. In contrast, some of the highly non-linear approaches that we discuss in the later chapters of this book can potentially provide quite accurate predictions for Y , but this comes at the expense of a less interpretable model for which inference is more challenging.

2.1.2 How do we estimate \(f\)?

We use the data:

  • We will always assume that we have observed a set of \(n\) different data points.
  • These observations are clled the the training data because we will use these data points to train (or teach) our model how to estimate \(f\).
  • We want to find a function \(\hat{f}\) such that \(Y \approx \hat{f}(X)\)

2.1.3 The trade-off between prediction accuracy and model interpretability

2.1.4 Supervised versus unsupervised learning

2.1.5 Regression versus classification problems

2.2 Assessing Model Accuracy

2.2.1 Measuring the quality of fit

2.2.2 The bias-variance trade-off

2.2.3 The classification setting

2.3 Lab: Intro to R

2.3.1 Basic Commands

  • c() combines its inputs into a vector
x <- c(1,3,2,5)
x
## [1] 1 3 2 5
  • The = symbol works too
y = c(1,4,3)
y
## [1] 1 4 3
  • Use ?c to get help with the c() or any function.
  • The ‘+’ function will add arrays by index (they must have the same length)
  • ?length to see what the length function does. Or any function.
x <- c(1,6,2)
y <- c(1,4,3)
length(x)
## [1] 3
length(y)
## [1] 3
x + y
## [1]  2 10  5
  • ls() allows you to look at a list of all objects.
  • rm() is used to delete objects
    • I rarely do this.
ls()
## [1] "x" "y"
rm(x,y)
ls()
## character(0)
  • Remove all objects
rm(list=ls())
  • matrix() will create a matrix
    • it needs 3 arguments, the data, the # of rows, and # of columns
    • by default it will create the matrix by columns, left to right
x = matrix(data=c(1,2,3,4), nrow = 2, ncol = 2)
x
##      [,1] [,2]
## [1,]    1    3
## [2,]    2    4
  • you don’t need to give the names of the arguments
  • you can change all of the default arguments, for example build a matrix by row, rathter than by column
  • if you do not assign the matrix to a variable, it will just create it and print it to the screen
x = matrix(c(1,2,3,4), 2, 2, byrow=T)
x
##      [,1] [,2]
## [1,]    1    2
## [2,]    3    4
  • sqrt() will work on each element of the matrix
  • x^2 will square each element of the matrix
sqrt(x)
##          [,1]     [,2]
## [1,] 1.000000 1.414214
## [2,] 1.732051 2.000000
x^2
##      [,1] [,2]
## [1,]    1    4
## [2,]    9   16
  • rnorm creates a vetor or random values sampled from the normal distribution
  • we get a different sample each time we call the function
  • we can change the values of the normal distribution.
  • By default the mean is zero and sd is 1
  • cor will compute the correlation between two vectors
x = rnorm(50)
y = x + rnorm(50, mean=50, sd = .1)
cor(x,y)
## [1] 0.9962304
  • set.seed will allow you to create the exact same set of random numbers
  • this is great for teaching as it lets others get the same set of random numbers in the next functinos
set.seed(1303)
rnorm(50)
##  [1] -1.1439763145  1.3421293656  2.1853904757  0.5363925179  0.0631929665
##  [6]  0.5022344825 -0.0004167247  0.5658198405 -0.5725226890 -1.1102250073
## [11] -0.0486871234 -0.6956562176  0.8289174803  0.2066528551 -0.2356745091
## [16] -0.5563104914 -0.3647543571  0.8623550343 -0.6307715354  0.3136021252
## [21] -0.9314953177  0.8238676185  0.5233707021  0.7069214120  0.4202043256
## [26] -0.2690521547 -1.5103172999 -0.6902124766 -0.1434719524 -1.0135274099
## [31]  1.5732737361  0.0127465055  0.8726470499  0.4220661905 -0.0188157917
## [36]  2.6157489689 -0.6931401748 -0.2663217810 -0.7206364412  1.3677342065
## [41]  0.2640073322  0.6321868074 -1.3306509858  0.0268888182  1.0406363208
## [46]  1.3120237985 -0.0300020767 -0.2500257125  0.0234144857  1.6598706557
  • mean calculates the mean of a vector of numbers
  • var calculates the variance of a vector of numbers
set.seed(3)
y = rnorm(100)
mean(y)
## [1] 0.01103557
var(y)
## [1] 0.7328675
# calculate standard deviation
sqrt(var(y))
## [1] 0.8560768
sd(y)
## [1] 0.8560768

2.3.2 Graphics

  • I will probably use ggvis for all plots rather than base R.
  • It is a big improvement on grammar and functionality
  • There may be some special R plots I may need to use occasionaly, but it seems to be rare.
x = rnorm(100)
y = rnorm(100)
plot(x,y)

plot(x,y, 
     xlab = "this is the x-axis",
     ylab = "this is the y-axis",
     main = "Plot of X vs Y"
     )

  • similar plot but with ggvis
data.frame(x=rnorm(100), y=rnorm(100)) %>% 
  ggvis(~x, ~y) %>%
  layer_points(fill := "blue")
  • you can save the output of an R plot
  • you can usepdf() or jpeg()
  • system lets you can run a system command on the computer
    • here I make a separat folder to put images
    • dev.off says we are done creating the plot
  • these steps are useful to share the analysis results in an image in an automated way
    • once you have created an analysis and explained it, it may be useful to send updated results on a daily basis to a location.
    • images are nice becasue they can easily be shared in an iframe or flow and be quickly viewed
system("mkdir images")
pdf("images/Figure.pdf")
plot(x,y,col="green")
dev.off()
## quartz_off_screen 
##                 2
  • ?seq creates a sequence of numbers
  • There are many option to control length and step size
x = seq(1:10)
x
##  [1]  1  2  3  4  5  6  7  8  9 10
## This is a shortcut
x = 1:10
x
##  [1]  1  2  3  4  5  6  7  8  9 10
x = seq(-pi, pi, length=20)
x
##  [1] -3.1415927 -2.8108987 -2.4802047 -2.1495108 -1.8188168 -1.4881228
##  [7] -1.1574289 -0.8267349 -0.4960409 -0.1653470  0.1653470  0.4960409
## [13]  0.8267349  1.1574289  1.4881228  1.8188168  2.1495108  2.4802047
## [19]  2.8108987  3.1415927
  • contour creates a contour plot of 3 dimensional data (like topographic map)
  • it takes an x, y, and z argument corresponding to x,y positions
  • there are many arguments to fine tune the plot
y = x
f = outer(x, y, function(x,y) cos(y)/(1+x^2))
contour(x,y,f)

contour(x,y,f,nlevels=45, add=T)

fa = (f-t(f))/2
contour(x,y,fa, nlevels=15)

  • image works like contour, but it uses colors like a heatmap
  • persp creates a 3D plot. THeta and phi control the viewing angle
image(x,y,fa)

persp(x,y,fa)

persp(x,y,fa, theta=30)

persp(x,y,fa, theta=30, phi=20)

persp(x,y,fa, theta=30, phi=70)

persp(x,y,fa, theta=30, phi=40)


2.3.3 Indexing Data

  • The book just covers matrices
  • It works similar for vectors, and dataframes.
  • I will mostly be working with dataframes so I added some exampoles here
A = matrix(1:16,4,4)
A
##      [,1] [,2] [,3] [,4]
## [1,]    1    5    9   13
## [2,]    2    6   10   14
## [3,]    3    7   11   15
## [4,]    4    8   12   16
A[2,3]
## [1] 10
A[c(1,3), c(2,4)]
##      [,1] [,2]
## [1,]    5   13
## [2,]    7   15
A[1:3,2:4]
##      [,1] [,2] [,3]
## [1,]    5    9   13
## [2,]    6   10   14
## [3,]    7   11   15
A[1:2,]
##      [,1] [,2] [,3] [,4]
## [1,]    1    5    9   13
## [2,]    2    6   10   14
A[,1:2]
##      [,1] [,2]
## [1,]    1    5
## [2,]    2    6
## [3,]    3    7
## [4,]    4    8
A[1,]
## [1]  1  5  9 13
A[-c(1,3),]
##      [,1] [,2] [,3] [,4]
## [1,]    2    6   10   14
## [2,]    4    8   12   16
dim(A)
## [1] 4 4
  • All of these work on a data frame also
  • dataframes can also be indexed by column names which reads a little cleaner
df <- data.frame(a=1:4, b=5:8, c=9:12, d=13:16)
df
##   a b  c  d
## 1 1 5  9 13
## 2 2 6 10 14
## 3 3 7 11 15
## 4 4 8 12 16
df[2,3]
## [1] 10
df[c(1,3), c(2,4)]
##   b  d
## 1 5 13
## 3 7 15
df[1:3,2:4]
##   b  c  d
## 1 5  9 13
## 2 6 10 14
## 3 7 11 15
df[1:2,]
##   a b  c  d
## 1 1 5  9 13
## 2 2 6 10 14
df[1:2] ## it defaults to the columns if you do not put in the ,
##   a b
## 1 1 5
## 2 2 6
## 3 3 7
## 4 4 8
df[,1:2]
##   a b
## 1 1 5
## 2 2 6
## 3 3 7
## 4 4 8
df[1,]
##   a b c  d
## 1 1 5 9 13
df[-c(1,3),]
##   a b  c  d
## 2 2 6 10 14
## 4 4 8 12 16
dim(df)
## [1] 4 4
df$a
## [1] 1 2 3 4
df[c('a','b')]
##   a b
## 1 1 5
## 2 2 6
## 3 3 7
## 4 4 8
df[1:2, c('a','b')]
##   a b
## 1 1 5
## 2 2 6
df[c(1,4), c('a','d')]
##   a  d
## 1 1 13
## 4 4 16

2.3.4 Loading Data

  • I will probably only use readr since it is a newer library that replaces a lot of the code in this section
  • The Auto dataset is already part of the ISLR library so I don’t need to load it
  • fix opens a spreadsheet viewer but its really slow
    • This has been replaced by the variable viewer built into R studio
  • read_csv is the ref function. It has many arguments to customize loading the data, but its defaults do a great job and make life easy. In most cases it does exactly what you would want.
## This just needs to be run the first time
# system("mkdir data")
# write_csv(Auto, "data/Auto.csv")

df_auto <- read_csv("data/Auto.csv")
glimpse(df_auto)
## Observations: 392
## Variables: 9
## $ mpg          <dbl> 18, 15, 18, 16, 17, 15, 14, 14, 14, 15, 15, 14, 1...
## $ cylinders    <int> 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 4, 6, 6...
## $ displacement <dbl> 307, 350, 318, 304, 302, 429, 454, 440, 455, 390,...
## $ horsepower   <int> 130, 165, 150, 150, 140, 198, 220, 215, 225, 190,...
## $ weight       <dbl> 3504, 3693, 3436, 3433, 3449, 4341, 4354, 4312, 4...
## $ acceleration <dbl> 12.0, 11.5, 11.0, 12.0, 10.5, 10.0, 9.0, 8.5, 10....
## $ year         <int> 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 70, 7...
## $ origin       <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 3, 1, 1...
## $ name         <chr> "chevrolet chevelle malibu", "buick skylark 320",...
head(df_auto)
## # A tibble: 6 x 9
##     mpg cylinders displacement horsepower weight acceleration  year origin
##   <dbl>     <int>        <dbl>      <int>  <dbl>        <dbl> <int>  <int>
## 1    18         8          307        130   3504         12.0    70      1
## 2    15         8          350        165   3693         11.5    70      1
## 3    18         8          318        150   3436         11.0    70      1
## 4    16         8          304        150   3433         12.0    70      1
## 5    17         8          302        140   3449         10.5    70      1
## 6    15         8          429        198   4341         10.0    70      1
## # ... with 1 more variables: name <chr>
names(df_auto)
## [1] "mpg"          "cylinders"    "displacement" "horsepower"  
## [5] "weight"       "acceleration" "year"         "origin"      
## [9] "name"

2.3.5 Aditional Graphical and Numerical Summaries

plot(df_auto$cylinders, df_auto$mpg)

  • I typically use ggvis library for plots or rcharts if I want javascript charts.
    • The base r plots are often used in old books like this but its worth learning the improved charting libraries.
  • I also never use attach because dplyr and ggvis let me reference the column name directly
df_auto %>%
  ggvis(~cylinders, ~mpg) %>%
  layer_points(
    fill := "green",
    fillOpacity := .4
    )
  • varwidth=T makes the box width based on the number of data points in the group I think
    • Thats one way to add more info to the chart but it does look a little weird.
df_auto$cylinders <- as.factor(df_auto$cylinders)

plot(df_auto$cylinders, df_auto$mpg)

plot(df_auto$cylinders, df_auto$mpg, col = 'red', varwidth=T, horizontal=T)

plot(df_auto$cylinders, df_auto$mpg, col = 'red', varwidth=T, xlab="cylinders", ylab = "MPG")

  • Its still a good idea to make the cylnders variable a categorical variable rather than numerical. i.e. make it a factor.
    • This lets the boxplot skip the 7 which has no values
df_auto %>%
  ggvis(~cylinders, ~mpg) %>%
  layer_boxplots(
    width = .9,
    fill := "green",
    fillOpacity := .4
    )
hist(df_auto$mpg)

hist(df_auto$mpg, col=2)

hist(df_auto$mpg, col=2, breaks=15)

attach(df_auto)
pairs(Auto)

pairs(~ mpg + displacement + horsepower + weight + acceleration)

  • I’m having trouble getting the identify function to work
    • I think the best way to do this is by adding a tooltip to the ggvis chart
plot(horsepower, mpg)
identify(horsepower, mpg, name)

## integer(0)
  • You can create a tooltip for the ggvis chart pretty easily
  • You just need a function that describes what to print and how
  • then you call that function in the add_tooltip function.
  • There rae plenty of examples to follow
  • This only seems to work in an interactive context so essentially it needs to use the shiny app
all_values <- function(x) {
  if(is.null(x)) return(NULL)
  str_c(names(x), ": ", format(x), collapse = '<br>')
}

df_auto %>%
  ggvis(~horsepower, ~mpg) %>%
  layer_points(
    fill := "green",
    fillOpacity := .6
  ) %>%
  add_tooltip(all_values, "hover")
summary(df_auto)
##       mpg        cylinders  displacement     horsepower        weight    
##  Min.   : 9.00   3:  4     Min.   : 68.0   Min.   : 46.0   Min.   :1613  
##  1st Qu.:17.00   4:199     1st Qu.:105.0   1st Qu.: 75.0   1st Qu.:2225  
##  Median :22.75   5:  3     Median :151.0   Median : 93.5   Median :2804  
##  Mean   :23.45   6: 83     Mean   :194.4   Mean   :104.5   Mean   :2978  
##  3rd Qu.:29.00   8:103     3rd Qu.:275.8   3rd Qu.:126.0   3rd Qu.:3615  
##  Max.   :46.60             Max.   :455.0   Max.   :230.0   Max.   :5140  
##   acceleration        year           origin          name          
##  Min.   : 8.00   Min.   :70.00   Min.   :1.000   Length:392        
##  1st Qu.:13.78   1st Qu.:73.00   1st Qu.:1.000   Class :character  
##  Median :15.50   Median :76.00   Median :1.000   Mode  :character  
##  Mean   :15.54   Mean   :75.98   Mean   :1.577                     
##  3rd Qu.:17.02   3rd Qu.:79.00   3rd Qu.:2.000                     
##  Max.   :24.80   Max.   :82.00   Max.   :3.000
summary(df_auto$mpg)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.00   17.00   22.75   23.45   29.00   46.60
  • since all of the analysis here is reproducible with the code, I don’t save the workspace.
    • It might be nice to do this if there is data that took a long time to get, but in that case I would just save the data. It all pretty fast to load once on the local machine
  • I also don’t use savehistory() or loadhistory() since I work in the editor or in this case in an r notebook.
    • the author was using the r command line so this would have been helpful.
    • would highly recommend using Rstudio and rmarkdown files (or notebooks) over using the r command line.

Thoughts

  • This section does introduce the reader to some interesting function in R.
  • Its clear that this was written decades ago.
  • R has come a long way with data analysis functions and I try to point to some of the more useful libraries here
  • I’m looking forward to getting to the statistical learning.

2.4 Exercises

data(College)
head(College)
##                              Private Apps Accept Enroll Top10perc
## Abilene Christian University     Yes 1660   1232    721        23
## Adelphi University               Yes 2186   1924    512        16
## Adrian College                   Yes 1428   1097    336        22
## Agnes Scott College              Yes  417    349    137        60
## Alaska Pacific University        Yes  193    146     55        16
## Albertson College                Yes  587    479    158        38
##                              Top25perc F.Undergrad P.Undergrad Outstate
## Abilene Christian University        52        2885         537     7440
## Adelphi University                  29        2683        1227    12280
## Adrian College                      50        1036          99    11250
## Agnes Scott College                 89         510          63    12960
## Alaska Pacific University           44         249         869     7560
## Albertson College                   62         678          41    13500
##                              Room.Board Books Personal PhD Terminal
## Abilene Christian University       3300   450     2200  70       78
## Adelphi University                 6450   750     1500  29       30
## Adrian College                     3750   400     1165  53       66
## Agnes Scott College                5450   450      875  92       97
## Alaska Pacific University          4120   800     1500  76       72
## Albertson College                  3335   500      675  67       73
##                              S.F.Ratio perc.alumni Expend Grad.Rate
## Abilene Christian University      18.1          12   7041        60
## Adelphi University                12.2          16  10527        56
## Adrian College                    12.9          30   8735        54
## Agnes Scott College                7.7          37  19016        59
## Alaska Pacific University         11.9           2  10922        15
## Albertson College                  9.4          11   9727        55